1 Objective

This notebook demonstrates an Environment‐Wide Association Study (EnWAS) work flow shown in Figure 1.1.

EnWAS Work Flow

Figure 1.1: EnWAS Work Flow

In this example, we are interested in finding nutrition predictors associated with diastolic blood pressure. To show work with the above steps, we will perform a linear regression model and a linear regression with applying natural spline functions on the continuous variables. In short, a spline-based linear regression model can capture the non-linearity trend of the data by using piece-wise polynomial functions on continuous variables. More information about spline can be found on wikipedia

2 Identify a response and a set of confounders that will be adjusted for

2.1 Identify The Data Source

We will use the NHANES as our data source more details about data can be found on the CDC website. A quick start notebook about accessing the data in our docker database with provided R function shown in here.

2.2 Identify a response, or outcome of interest.

2.3 FIXME: you probably want to specify some “entry” criteria

2.4 age and sex, what if any restrictions, what about comorbidities

2.5 what about people taking drugs to lower their blood pressure?

2.6 do we want to deal with obesity - would we just use “Normal BMI”

2.7 Most of this information might go into a Methods Section, and some of the

2.8 analytic details would go in to a Supplement.

2.9 describe the specifics of the processing in the methods section.

2.10 Something like - For individuals with two measurements we averaged the two,

2.11 for those with only one, we used the one, all others would be missing.

2.12

In this demonstration, we choose diastolic blood pressure as the outcome interest. These data are contained in tables in the NHANES database.

For each survey a number of people were chosen to provide blood pressure measurements and these were obtained in duplicate on two separate days. To get an more accurate measurement of the diastolic blood pressure, we average two times of measurements for each participants.

2.13 Identify a set of confounders that will be adjusted for. These are typically not of interest themselves, but are important established confounders such as age or sex that should be adjusted for.

2.14 FIXME: in the Methods section describe any transformations, or details of the data - ##especially which tables they come from, for example.

2.15 FIXME: here you need to put in citations for the literature you used - and use the papers ## to explain why you chose these variables. We could also do univiariate model fitting to ## each of them and report it in the supplement just to show that they have some effect on BP ## on their own. We also need to explain why we are using systolic or diastolic…

Based on our literature review we feel that sex, age, ethnicity, education and BMI are likely to affect blood pressure, and we choose them as the confounders to adjust for. In a more realistic analysis we would likely also want to find out who is taking medications that are intended to alter blood pressure, but for simplicity of the exposition we will ignore that issue in this analysis.

Normally the model fitting and QA/QC process would be carried out in an exploratory and iterative fashion, however that is not easy to capture in a static document. Instead we will compare two basic models, one where we assume that there is a straight line relationship between continuous confounders and our response, and a second approach where we use spline models. This approach has some similarities to the Generalized Additive Model (GAMs) and much of the literature there is useful. There are some differences though, and we will outline them as we go.

2.16 Load data that contains responses and confounders, and preprocess using PHONTO

We load the NHANES data using functions from PHONTO package. The PHONTO package provides a set function to facilitate researchers to get access NHANES data in our docker database. More detail about data query and search tools can be found on quick start page. In addition, the phesant() function allows the user to check the data inspired by the PHESANT package.

cols = list(DEMO_C=c("RIDAGEYR","RIAGENDR","RIDRETH1","DMDEDUC2"), 
            DEMO_D=c("RIDAGEYR","RIAGENDR","RIDRETH1","DMDEDUC2"), 
            DEMO_E=c("RIDAGEYR","RIAGENDR","RIDRETH1","DMDEDUC2"), 
            BPX_C=c('BPXDI1','BPXDI2'),
            BPX_D=c('BPXDI1','BPXDI2'),
            BPX_E=c('BPXDI1','BPXDI2'),
            BMX_C="BMXBMI",
            BMX_D="BMXBMI",
            BMX_E="BMXBMI"
            )
base_df <- jointQuery(cols) 

base_df |> DT::datatable(options=list(pageLength=5))
#> Warning in instance$preRenderHook(instance): It seems your data is too big
#> for client-side DataTables. You may consider server-side processing: https://
#> rstudio.github.io/DT/server.html

Transform education and BMI according CDC

# remove age under 20 and diastolic with 0s
base_df = base_df |> subset(RIDAGEYR>20 & BPXDI1!=0 & BPXDI2!=0 )
# assign the gender and ethnicity to the real levels
# Transform education to:
#  < High School
#   - High School
#  > High School
trans_DEDUC = ifelse(stringr::str_detect(base_df$DMDEDUC2,"9"), "<HS",
                
                ifelse(stringr::str_detect(base_df$DMDEDUC2,"High School"), "HS", 
                ifelse(stringr::str_detect(base_df$DMDEDUC2,'College'), ">HS", NA)))
base_df$DMDEDUC2 <- as.factor(trans_DEDUC)

# transform the BMI to:
#  underweight range.
#  healthy weight range.
#  overweight range.
#  obesity range.
trans_BMI = ifelse(base_df$BMXBMI < 18.5, "underweight", 
                   ifelse(base_df$BMXBMI < 25, "healthy", 
                   ifelse(base_df$BMXBMI < 30, "overweight", 
                   ifelse(base_df$BMXBMI >= 30, "obesity",     
                          NA))))

base_df$BMXBMI = as.factor(trans_BMI)
base_df$years = as.factor(paste0(base_df$Begin.Year,"-",base_df$EndYear))

base_df = na.omit(base_df)
# Take average first and second read for the diastolic blood pressure.
base_df$DIASTOLIC <- (base_df$BPXDI1+base_df$BPXDI2)/2

3 Build Baseline Model

To find plausible models, we should build models and perform QA/QC process iteratively. In the following demonstrations, we built a linear regression model and another linear model with apply natural spline function on the continuous variables. The outcome is diastolic is the average of the diastolic first (BPXDI1) and second (BPXDI2) reads, gender(RIAGENDR), age (RIDAGEYR), ethnicity (RIDRETH1), BMI(BMXBMI) and education(DMDEDUC2).

ns_str <-
  'DIASTOLIC ~ ns(RIDAGEYR, knots = seq(30, 80, by = 10), Boundary.knots=c(20,90)) * RIAGENDR + BMXBMI + RIDRETH1+DMDEDUC2+years'
ns_base <- lm(formula = as.formula(ns_str), base_df)

4 QA/QC for Base Model

We need to check the base model and ensure it runs correctly before performing EnWAS. However, the classical methods such as Q-Q plots, residual plots, and goodness of fit (GoF) tests are generally ill-suited. We want to provide the following plots to demonstrate the criteria of QA/QC.

It is clear to see that the diastolic blood pressure has a parabola-like shape relation with the age, as shown in Figure 4.1. The yellow and blue lines are generated by smooth prediction from linear and spline models. The dots are randomly sampled in 20% of the data points.

Linear vs Spline

Figure 4.1: Linear vs Spline

4.1 Residuals vs. Covarites and Fitted Values

The transnational methods cannot detect the linear model fit data poorly.

We can check the residuals against the fitted value with a smooth scatter plot. And we find that there are no apparent trends for both models, even though the spline model has fewer mean square errors.

A possible solution to check the “goodness of fit (GoF)” is to check whether apparent trends in the plots of residual against terms in the models. We can spot a slight trend residual in the BMI range from 20 to 40, indicating that using linear regression on BMI term may not hurt the model performance too much. However, a strong parabola-like trend can be observed in the residuals of the linear model with respect to ages, which indicates that the linear model cannot capture age. In other words, the model is not good enough to be a base model to run EnWAS; the findings are more likely false positives if using such a base model. On the other hand, the residuals spline model has no clear trends with respect to both BMI and age, which means the base model captures the relations of outcomes (diastolic) and the known confounders.

We can further check the base models with binned plots, which can be helpful when the data set is large. The binned plot is a way that “zoom in” to look at the trends.

Figure 4.2 shows QA/QC plots.

  • a) Smooth scatter plots for residuals with respect to fitted values, and there is no strong pattern in both cases even though the spline has less mean square error than the linear model.
  • b) Smooth scatter plots for residuals with respect to age variable, and the linear model has a parabola-like pattern, whereas no obvious pattern for the spline model.
  • c) Binned plots for residuals with respect to age variable to look at the trends of residuals against age.
Quality Assurance and Quality Control

Figure 4.2: Quality Assurance and Quality Control

5 Identify a set of phenotypes and exposures

##FIXME: again in the methods section you should give some explicit information ## about these variables.

Identify a set of phenotypes and exposures that we want to test for association with the response variable of interest. In this example, we choose phenotype from the dietary data, we identified the phenotypes and saved them into a built-in variable call exposure_vars in EnWAS package. “The objective of the dietary interview component is to obtain detailed dietary intake information from NHANES participants. The dietary intake data are used to estimate the types and amounts of foods and beverages (including all types of water) consumed during the 24-hour period prior to the interview (midnight to midnight), and to estimate intakes of energy, nutrients, and other food components from those foods and beverages.” More detail can be found code book.

cols = list("DR1TOT_C"=exposure_vars,"DR1TOT_D"=exposure_vars,"DR1TOT_E"=exposure_vars)
diet_data = unionQuery(cols)
dim(diet_data)
#> [1] 29355    52

Once we load the dietary data, we should also check it with the PHESANT function and preprocess it if necessary. We may need to eplore the missing data for the above data according to the notebook. Currently, we only keep the columns that has less missing rate less than 15%. The observations with missing values in the covariates in the baseline model or current phenotype will be remove during EnWAS.

5.1 Transformation

In NHANES, the phenotypes are often right-skewed with a very long tail on the right side of the distribution, which can be addressed with logarithm transformation followed by a z-transformation. However, it would take a tremendous effort to manually and exhaustively inspect the distributions and figure out appropriate transformation methods for all kinds of phenotypes with different types of distribution when dealing with extensive data sets. Therefore, we recommend using inverse normal transformation (INT) for all continuous variables because INT can apply various distributions. We compared the EnWAS results of logarithm transformation followed by a z-transformation with normal inverse transformation.

Figure 5.1 shows an example of Total fat (DR1TTFAT) with non transformation, log-z transformation, and inverse normal transformation.

Transformation for Total fat (gm)

Figure 5.1: Transformation for Total fat (gm)

6 Carry out a set of regression models.

In this step, we need to carry out a set of regression models, one for each exposure/phenotype in Step 5, and report on the analysis.

data = merge(base_df,diet_data,by = "SEQN")
data = na.omit(data)
xwas = enwas(ns_str,exposure_vars,data)
#> Warning in is.data.frame(x): NAs introduced by coercion

#> Warning in is.data.frame(x): NAs introduced by coercion

#> Warning in is.data.frame(x): NAs introduced by coercion
xwas_log <- enwas(ns_str,exposure_vars,data,trans = "log")
xwas_inv <- enwas(ns_str,exposure_vars,data,trans = "inv")

Figure 6.1 shows the estimates and CI of the exposure variables and only displays the top 10 (if they have more than 30 satisfy the filters) ranked by absolute values of the estimates. The variables with their CI containing zeros are also removed.

forest_plot(xwas$enwas_res,10) # filter out CI contains 0
Forest Plot for non tranformation EnWAS

Figure 6.1: Forest Plot for non tranformation EnWAS

Figure 6.2 shows the top 10 exposures, ranked by the differences in the estimates for the same variables. - ns denotes the variables non-transformed, but the estimates with beta^hat * SD(X) - ns_inv denotes variables transformed inverse normal transformation - ns-log denotes variables transformed with log followed by z-transformation

Multiple Forest Plots

Figure 6.2: Multiple Forest Plots

The following scatter plot shows the inverse normal transformation estimates against estimates (beta^hat * SD(X)) of nontransformed variables. The top 20 has added text for the variables, but it is pretty clear to show the information.

Figure 6.3 estimates of EnWAS non-transformed (EnWAS) with inverse normal transformation (EnWAS INT). The the top 20 most difference variables label with variable names.

Scatter Plot for EnWAS against EnWAS with INT

Figure 6.3: Scatter Plot for EnWAS against EnWAS with INT

We can also compare the non-transformed with log-z transformation, and log-z transformation with inverse normal transformation.